Voice control hub methods and systems

ABSTRACT

A method for application voice access and control includes receiving a handler registration, receiving an utterance, transmitting the utterance to a cloud layer, receiving an intent and an entity from the cloud layer and dispatching the intent and the entity to the handler. A voice control hub system includes a processor and a memory storing instructions that, when executed by the processor, cause the server to receive a handler registration, receive an utterance, transmit the utterance to a cloud layer, receive an intent and an entity from the cloud layer, and dispatch the intent and the entity to the handler. A non-transitory computer readable medium includes program instructions that when executed, cause a computer to receive a handler, receive an utterance, transmit the utterance to a cloud layer, receive an intent and an entity from the cloud layer, and dispatch the intent and the entity to the handler.

FIELD OF THE DISCLOSURE

The present disclosure is generally directed to methods and systems forautomated graphical user interface control methods and systems and, moreparticularly, to techniques for accessing and controlling graphical userinterfaces using vocal commands uttered by a user of an application.

BACKGROUND

Existing enterprise systems used to manage an organization'srelationships and interactions with customers (e.g., a CustomerRelationship Management (CRM) system or a Quantitative Risk Management(QRM) system) require employees to perform repetitive tasks. Existingconsumer voice-based systems such as Google Assistant, Amazon Alexa,etc. are proprietary systems that do not allow an enterprise tomeaningfully integrate those systems with internal enterprise systems.Such consumer systems have several drawbacks. First, consumer systems donot include a public application programming interface (API). Second,consumer systems are geared toward consumer development and requiresensitive data to transit networks and data centers controlled by thirdparties. Using such networks and data centers represent security anddata privacy risks. Such systems are simply not intended to accessenterprise data. Third, consumer systems base understanding ofutterances on template programming. Therefore, such systems are brittle,and unable to process slight variations in speech. When voice responseis based upon templates, it is generally not customizable for specificbusiness use cases/contexts. Fourth, consumer systems do not allow auser (e.g., a developer) to access integrated functions in a processlocated within an enterprise computing environment, or to keep suchprocesses located on the premises of the enterprise. Rather, theconsumer system owner requires that all access occur in a captive cloudenvironment. An enterprise may choose to exempt certain network accessthrough the network firewall of the enterprise, but to do so weakenssecurity across the enterprise, and other teams may object (e.g., anInfoSec team). Fifth, any reaction to a spoken utterance is generated bythe consumer system, not a local process. Sixth, the requirement toaccess remote cloud-based resources causes latencies in processing thatmay be frustrating to end users.

In many cases, consumer voice system APIs are rigid and cannot beintegrated with legacy applications due to a mismatch between theprogramming capabilities of the consumer voice systems and the olderapplications. A further drawback of existing voice systems is that theyprovide no facilities to allow a user to specify custom voice commands,because response actions are hard-coded. In other words, whateverfunctionality the owner of the consumer voice system chose is availableto the user, and the user cannot express any other voice commands. Theuser has no ability to change fundamental design decisions, such as themethod for accessing the consumer voice system. Moreover, as businessimperatives shift over time, the set of available commands may becomequickly outdated. Further, existing systems require developers to writecode to implement voice functionality, which is time-consuming anderror-prone. Existing systems to not include any facilities that allowend users to create voice code-free voice functionality.

BRIEF SUMMARY

In one aspect, a computer-implemented method of enabling voicefunctionality in an application includes receiving a handlerregistration request specifying an object handler to respond to voicecommands, receiving an utterance of a user, transmitting the utteranceof the user to a remote cloud services layer, receiving an intent and anentity from the remote cloud services layer, wherein the intent isassociated with the entity, and dispatching the intent and the entity tothe object handler.

In another aspect, a voice control hub computing system includes one ormore processors and a memory storing instructions that, when executed bythe one or more processors, cause the server to receive a handlerregistration request specifying an object handler to respond to voicecommands, receive an utterance of a user, transmit the utterance of theuser to a remote cloud services layer, receive an intent and an entityfrom the remote cloud services layer, wherein the intent is associatedwith the entity, and dispatch the intent and the entity to the objecthandler.

In yet another aspect, a non-transitory computer readable mediumincludes program instructions that when executed, cause a computer toreceive a handler registration request specifying an object handler torespond to voice commands, receive an utterance of a user, transmit theutterance of the user to a remote cloud services layer, receive anintent and an entity from the remote cloud services layer, wherein theintent is associated with the entity, and dispatch the intent and theentity to the object handler.

BRIEF DESCRIPTION OF THE DRAWINGS

The figures described below depict various aspects of the system andmethods disclosed therein. It should be understood that each figuredepicts one embodiment of a particular aspect of the disclosed systemand methods, and that each of the figures is intended to accord with apossible embodiment thereof. Further, wherever possible, the followingdescription refers to the reference numerals included in the followingfigures, in which features depicted in multiple figures are designatedwith consistent reference numerals.

FIG. 1 depicts an exemplary computing environment in which techniquesfor voice control of enterprise systems may be implemented, according toone embodiment.

FIG. 2 depicts a conceptual model for voice control of enterprisesystems, according to an embodiment.

FIG. 3 depicts a schematic class diagram for voice control of anapplication, according to an embodiment.

FIG. 4 depicts an example graphical user interface for allowing a userto create an order, according to an embodiment.

FIG. 5 depicts an example convolutional neural network, according to anembodiment.

FIG. 6 depicts an example graphical user interface for setting adynamically-compiled handler, according to an embodiment.

FIG. 7 depicts a flow diagram of a method of visual programming forintent dispatch, according to one embodiment and scenario.

FIG. 8 depicts an action palette graphical user interface, according toan embodiment.

FIG. 9 depicts an example flow diagram of a method for enabling voicefunctionality in an application, according to an embodiment.

The figures depict preferred embodiments for purposes of illustrationonly. One skilled in the art will readily recognize from the followingdiscussion that alternative embodiments of the systems and methodsillustrated herein may be employed without departing from the principlesof the invention described herein.

DETAILED DESCRIPTION Overview

The embodiments described herein relate to, inter alia, techniques forvoice control of enterprise systems. In an embodiment, voice control hubmethods and systems implement components for adding voice controlcapabilities to an application and/or business objects. Anotherembodiment relates to automated user interface (GUI) control methods andsystems using voice commands. Yet another embodiment includes visualprogramming methods and systems for intent dispatch. Another embodimentincludes the use of a computer vision technique (e.g., a trainedconvolutional neural network (CNN)) to identify graphical user interfacecontrols for use in, inter alia, voice control applications. Theembodiments disclosed herein allow enterprise employees (e.g., a user,administrator, manager, programmer, etc.) to, inter alia, addcloud-based voice functionality to business objects, retrofit legacyapplications to include voice capabilities, automate applications usingvoice capabilities for robotic process automation (RPA), and createpalettes of actions for intents.

Exemplary Computing Environment

FIG. 1 depicts an exemplary computing environment 100 in which thetechniques disclosed herein may be implemented, according to anembodiment. The enterprise environment 100 includes a computing device102, a network 104, and a cloud 106. Some embodiments may include aplurality of computing devices 102. The enterprise environment 100 mayinclude a logical separation, corresponding to a respective privateenterprise logical portion and a public logical portion.

Generally, the computing device 102 is located in the private enterpriselogical portion of the environment 100, as demarcated by an enterprisefirewall 102 of FIG. 1. For example, the enterprise logical portion maycorrespond to a networking subgroup that is located behind theenterprise firewall 108. The enterprise firewall 108 may be establishedby, for example, a software and/or hardware router.

The computing device 102 may be an individual server, a group (e.g.,cluster) of multiple servers, or another suitable type of computingdevice or system (e.g., a collection of computing resources). Forexample, the computing device 102 may be any suitable computing device(e.g., a server, a mobile computing device, a smart phone, a tablet, alaptop, a wearable device, etc.). In some embodiments, one or morecomponents of the computing device 120 may be provided by virtualinstances (e.g., cloud-based virtualization services). In such cases,one or more computing device 102 may be included in the public logicalportion of the computing environment 100. One or more computing device102 located in the enterprise portion of the computing environment 100may be linked to one or more computing device 102 located in the publicportion of the environment 100 via the network 104.

The network 104 may be a single communication network, or may includemultiple communication networks of one or more types (e.g., one or morewired and/or wireless local area networks (LANs), and/or one or morewired and/or wireless wide area networks (WANs) such as the Internet).The network 104 may enable bidirectional communication between thecomputing device 102 and the cloud 106, or between multiple computingdevices 102, for example. The network 104 may be located within theenterprise logical portion and/or within the public logical portion ofthe computing environment 100.

The computing device 102 includes a processor 110 and a networkinterface controller (NIC) 112. The processor 110 may include anysuitable number of processors and/or processor types, such as CPUs andone or more graphics processing units (GPUs). Generally, the processor110 is configured to execute software instructions stored in a memory114. The memory 114 may include one or more persistent memories (e.g., ahard drive/solid state memory) and stores one or more set of computerexecutable instructions/modules, including a speech-to-text module 116,an intent module 118, a framework module 120, an event handler module122, a machine learning (ML) training module 124, an ML operation module126, etc., as described in more detail below.

The NIC 112 may include any suitable network interface controller(s),such as wired/wireless controllers (e.g., Ethernet controllers), andfacilitate bidirectional/multiplexed networking over the network 104between the computing device 102 and other components of the environment100 (e.g., another computing device 102, the cloud 106, etc.).

The memory 114 includes one or more modules for implementing specificfunctionality. For example, in an embodiment, the memory 114 includes aspeech-to-text module 116 for accessing a speech-to-text API of a cloudcomputing environment (e.g., the cloud 106 ). The speech-to-text module116 includes computer-executable instructions for receiving audio speechdata from an input device of the computing device 102, and fortransmitting the audio speech data to via the network 104. Thespeech-to-text module 116 may include instructions for receiving audiospeech data periodically, continuously, and/or in response to a useraction via an input device of the computing device 102. Thespeech-to-text module 116 may receive textual output corresponding tothe transmitted audio speech data. The speech-to-text module 116 mayreceive a string of structured data (e.g., JSON) including the textualoutput produced by the speech-to-text API, and pass the string to theintent module 118.

For example, the audio speech data may include a user query. Herein, a“user” may be a developer of an enterprise application, an end user ofthat application (e.g., a business person), etc. The speech-to-text APImay translate the audio speech data to a string query (e.g., “Who isJoe's manager?”) and forward the string query to the intent module 118.The string query may be provided by the user as a spoken utterance, as atyped phrase, etc. Additional examples of user queries may include,inter alia, a command to swipe the user in (“Swipe me in.”), a commandto query vacation balance (“What is my vacation balance?”), a command tofind a phone extension of a person (“What is the phone extension for BobSmith?”), a request for personnel information (“Who is the manager forBob Smith?”), etc. Each query may receive a response from the cloud 106.For example, response to the above user queries regarding vacationbalance may include “Your vacation balance is 3 weeks, 7 hours”. In somecases, the response may be a number of seconds that may be converted toa number of weeks and seconds. The telephone extension response may be atelephone extension number (e.g., 6543). The response may to anorganizational query may be a tree structure such as a node of a manager(e.g., Jane Doe) having a leaf note of a subordinate employee (e.g., BobSmith). In each case, the intent module 118 may include a set ofcomputer-executable instructions for analyzing the query response.

The intent module 118 may include computer-executable instructions fortransmitting a request, and receiving and/or retrieving a response froma command interpreter API of the cloud 106. The request may include atextual string, such as the textual output received by thespeech-to-text module 116 (e.g., the string of structured data). Theintent module 118 may retrieve a subpart of structured data (e.g., ahash value) from the structured data corresponding to the textcorresponding to a vocal utterance (e.g., “Who is Joe's manager?”) Theintent module 118 may include the text in a request to the commandinterpreter API. The intent module 118 may retrieve/receive the responsefrom the command interpreter API and forward the response to theframework module 120 and/or the event handler 122 for furtherprocessing. Continuing the above example, the command interpreter APImay return a response such as the following:

{ “query”: “who is joe smith's manager?”,  “topScoringIntent”: {   “intent”: “Query_ManagerName”,    “score”: 0.797840059  },  “intents”:[  {    “intent”: “Query_ManagerName”,    “score”: 0.797840059  },  {   “intent”: “Query_ManagerOfficePhone”,    “score”: 0.5884016  },  {   “intent”: “Action_GoToManagerByName”,    “score”: 0.4812061  },  {   “intent”: “Query_CellPhone”,    “score”: 0.4402233  },  {    “intent”:“Query_HomeAddress”,    “score”: 0.201515391  },  {    “intent”:“Query_HomePhone”,    “score”: 0.119886041  },  {    “intent”:“Query_StartDate”,    “score”: 0.09225527  },  {    “intent”:“Query_EmpID”,    “score”: 0.04686539  },  {    “intent”: “Query_Title”,   “score”: 0.0423500538  },  {    “intent”: “Query_Department”,   “score”: 0.028601585  },  {    “intent”: “Query_OfficePhone”,   “score”: 0.0189168919  },  {    “intent”: “Query_UsedVacation”,   “score”: 0.0107963076  },  {    “intent”: “None”,    “score”:0.0105551314  },  {    “intent”: “Query_AnyMeetingToday”,    “score”:0.008343225  },  {    “intent”: “Action_GoToCoworkerByName”,    “score”:0.005655013  },  {    “intent”: “Query_AccruedSickTime”,    “score”:0.00395378657  },  {    “intent”: “Query_Birthday”,    “score”:0.0039028204  },  {    “intent”: “Query_UsedSickTime”,    “score”:0.00316411234  },  {    “intent”: “Query_CoworkerPhoneExt”,    “score”:0.00306525687  },  {    “intent”: “Action_GoToCoworkerByID”,    “score”:0.0018620895  },  {    “intent”: “Query_AccruedVacation”,    “score”:0.00123780384  },  {    “intent”: “Action_SwipeIn”,    “score”:0.0006449005  },  {    “intent”: “Action_SwipeOut”,    “score”:0.0006220838  } ], “entities”: [ {  “entity”: “joe smith”,  “type”:“builtin.personName”,  “startIndex”: 7,  “endIndex”: 17 }, {  “entity”:“joe smith”,  “type”: “Pronoun”,  “startIndex”: 7,  “endIndex”: 17, “resolution”: {    “values”: [     “she”   ] } } ], ... }

The response includes a subpart including a top scoring intent,Query_ManagerName, which the command interpreter API scores as mostlikely corresponding to the intent of the user query. The commandinterpreter API includes a list of other intents and respective scoresthat the intent module 118 may analyze. For example, in an embodiment,the framework module 122 may use the second, third, etc. highest-scoredintents to determine whether the user's intent is different from the topscored intent, or use the results for other purposes. The response alsoincludes one or more entity corresponding to an entity identified by thecommand interpreter API. The entity may have a type, such as personName.The framework module 122 may analyze the entity and take action based onthe entity type.

It should be appreciated that the response may include one or moreintent associated with a respective one or more entity. This is animportant distinction for some embodiments, such as compound commandembodiments wherein a user issues two commands in a single utterance(e.g., “Set account manager to Jim's supervisor and set shipping toovernight”). In such an example, the response may include the identityof Jim's supervisor (e.g., Jill). The response may then return a nesteddata structure including the two intents (“set account manager” and “setshipping class”) and their respective entities (“Jill” and “Overnight”).

Each entity in the command interpreter API may be associated with arespective type and a respective set of labeled utterances. For example,a set of entities in the command interpreter API may include aCoworkerlD, an EnglishName, a NonEnglishFirstName, a NonEnglishLastName,a NonEnglishName, an OrderNumber, a Pronoun, a personName, etc. Entitytypes include a regular expression type, a composite type, a simpletype, a list type, and a prebuilt type, for example.

Generally, the framework module 122 enables, inter alia, registration ofa handler with respect to an object. For example, a user may register ahandler for any object, wherein the object is a business object, a GUIelement, etc. Providing a link between a voice command and internalenterprise-level business object, which may even be a hidden businessobject, is an advantage of the present techniques. Once the handler isregistered, the event handler 122 receives the forwarded response fromthe intent module 118, which contains one or more command (e.g., intent)and/or an entity. The event handler 122 processes the command inaccordance with the registration performed by the framework module 122.

The ML training module 124 enables the environment 100 to train one ormore ML model. For example, the ML training module 124 may includeinstructions for creating a new model, loading data, labeling data, andperforming training operations such as gradient descent, etc. The MLtraining module 124 may include instructions for serializing a model,and saving the model to a disk (e.g., the memory 114 ), including anyweights associated with the trained ML model. The ML training module 124may include instructions for cross-validation of a trained model, inaddition to instructions for separating data sets into training andvalidation data sets.

The ML operation module 126 enables the environment 100 to operate MLmodels trained by the ML training module 126 or another ML trainingsystem. The ML operation module includes instructions for loading aserialized model from a disk or a database (e.g., a deserializationinstruction), and for initializing the trained ML model with savedweights. The ML operation module may include instructions forcommunicating an ML operation status to another module.

In an embodiment, the framework 120 and/or event handler 122 may beencapsulated in a voice control hub, which is a self-contained modulethat can be added to any application, including legacy applications, toadd voice functionality. In some embodiments, the framework and/or thevoice control hub may be implemented in a programming language/frameworksuch as Microsoft .NET. In some embodiments, the voice control hub andapplication(s) may execute in a common process space. The voice controlhub and application(s) may be implemented in a common technology (e.g.,a .NET-compatible technology) or different technologies linked by adistributed computing interface (e.g., a Remote Procedure Call). Thevoice control hub may implement one or more design pattern (e.g., anobject-oriented proxy pattern) that allows an application to implement avoice intent handler to provide natural language understanding to anyexisting or new application. The framework 120 may include instructionsfor registering a set of instructions for execution when a particulartype of intent (e.g., SWIPE IN) is received. The voice control hubincludes all functionality necessary to access cloud APIs such as thespeech-to-text API and command interpreter API discussed above. In thisway, the voice control hub links remote cloud API resources to anapplication in a way that separates the concerns of the application fromthose of the concerns necessary to access remote APIs. A benefit of suchseparation of concerns is simplifying development of the application.Another benefit is allowing developers to improve and change the voicecontrol hub without requiring any changes to the application using thevoice control hub. Other benefits are envisioned, including allowing theuse of newer technologies that may not be directly integrated withlegacy applications.

The computing device 102 may include a data store 130, which may beimplemented as a relational database management system (RDBMS) in someembodiments. For example, the data store 130 may include one or morestructured query language (SQL) database, a NoSQL database, a flat filestorage system, or any other suitable data storage system/configuration.In general, the data store 130 allows the computing device 102 tocreate, retrieve, update, and/or retrieve records relating toperformance of the techniques herein. For example, the data store 130may allow the computing device 102 to retrieve human resources (HR)records relating to an enterprise employee (e.g., an employee's positionin an organizational chart). The data store 130 may include aLightweight Directory Access Protocol (LDAP) directory, in someembodiments. The computing device 102 may include a module (notdepicted) including a set of instructions for querying an RDBMS, an LDAPserver, etc. In some embodiments, the data store 130 may be locatedremotely from the computing device 102, in which case the computingdevice 102 may access the data store 130 via the NIC 112 and the network104.

The computing device 120 may include an input device 140 and an outputdevice 142. The input device 140 may include any suitable device ordevices for receiving input, such as one or more microphone, one or morecamera, a hardware keyboard, a hardware mouse, a capacitive touchscreen, etc. The output device 142 may include any suitable device forconveying output, such as a hardware speaker, a computer monitor, atouch screen, etc. In some cases, the input device 140 and the outputdevice 142 may be integrated into a single device, such as a touchscreen device that accepts user input and displays output.

The computing device 120 may be associated with (e.g., owned/operatedby) a company that creates voice-based functionality for the performanceof tasks within enterprise business software (e.g., CRM software, QRMsoftware, etc.).

Generally, the cloud 106 may include one or more APIs implemented asendpoints accessible via a web service protocol, such asrepresentational state transfer (REST), Simple Object Access Protocol(SOAP), JavaScript Object Notation (JSON), etc. For example, in anembodiment. The one or more APIs may be implemented as one or moremodules including respective sets of computer-executable instructionsfor performing functions related to voice control of enterprisecomputing systems. For example, in the depicted embodiment, the cloud106 includes a speech to text API 150 for transforming speech audio totext and a command interpreter API 152 for analyzing text to determineone or more intent and/or one or more entity.

The cloud 106 may be located at a private data center or a data centerowned by a cloud computing provider. The one or more APIs of the cloud106 may be implemented in virtual machine instances of a cloud providerrented, leased, or otherwise controlled by a corporation controlling thecomputing device 102. The one or more APIs of the cloud 106 may beimplemented within respective computing devices similar to the computingdevice 102. The cloud 106 may be communicatively coupled to one or moredata storage system, such as an RDBMS.

The speech-to-text API 150 is generally configured to receive audiospeech data (e.g., WAV or other sound files) from the computing device102 and to analyze the audio speech data to generate a textual outputcorresponding to the speech data. For example, audio speech data may bereceived by the speech-to-text API 150 that includes a recording of anaccount manager utterance (e.g., the account manager may utter, “Swipeme in.”). The speech-to-text API 150 may analyze the account managerutterance to identify a textual output corresponding to the utterance.The textual output may be associated with an identifier of the audiospeech data (e.g., a filename or a file path). The textual output mayinclude a string of characters corresponding to the utterance (e.g.,“Swipe me in.”).

In an embodiment, the speech-to-text API 150 is a custom speech modelthat is configured to improve the speech-to-text analysis. The customspeech model may allow for the speech-to-text API 150 to filter outbackground noise and/or to analyze audio speech data that includesspeech accents. For example, the speech-to-text API 150 may be anacoustic model that is configured to handle background noise, such asthe noise generated by an office environment or other workplace. Thespeech-to-text API 150 may analyze language to determine examples ofcommon phrases, or snippets, wherein grammar is not emphasized. Thespeech-to-text API 150 may include pronunciation facilities foridentifying jargon or phrases specific to a particular employerenvironment. For example, in an embodiment, the speech-to-text API 150may be configured to correctly identify product-specific terms such asan Electronic Data Code (“EDC”), or another product identifier androle-specific terms such as Account Manager (“AM”). The speech-to-textAPI 150 may be configured to identify both the terms “EDC” and “AM” aswell as instances of such terms. For example, the audio speech data mayinclude a recording of a person uttering an instruction to order an itemfrom a particular account manager. The speech-to-text API 150 maygenerate a textual output: “Order 3 EDC 2154459 for Account 123abc andnotify AM Jane Doe”. Here, the speech-to-text API 150 is pre-trained toidentify EDC and AM as known words in a lexicon. In some embodiments,other words may be added to the lexicon. The speech-to-text API 150 maybe implemented using any suitable technologies. For example, in anembodiment, portions of the speech-to-text API 150 are implemented usingthe Microsoft Bing Speech-to-Text API. The command interpreter 152 ofthe cloud 106 may analyze the textual output of the speech-to-text API150 to identify an intent and/or an entity.

The command interpreter 152 is generally configured to receive textualinput (e.g., phrases, strings, sentences, textual utterances, etc.) andto analyze the textual input using one or more trained model to generateone or both of an intent and an entity. The one or more trained modelmay be a machine learning (ML) model for interpreting commands. The MLmodel may be trained by providing examples of commands that are analyzedduring a training phase, wherein the ML model is trained, or learns, toassociate inputs with commands/intents. For example, utterances such as“Swipe me in” or “Swipe in” and other variations may be used to trainthe ML model. Then, given a textual input of “I'd like to swipe in”, thecommand interpreter 152 may generate an intent of “Swipeln” with acorresponding score of “0.98765”, wherein the score represents themodel's confidence. The command interpreter 152 may generate an entityof “Person” or “Employee” having a value equal to the identity of thespeaker (e.g., the person swiping in). Another application may analyzethe output of the command interpreter 152 to perform an action, such asperforming a time clock in on behalf of a user.

The command interpreter 152 may use reinforcement learning and/or activelearning to periodically update the ML model, so that the commandinterpreter 152 becomes more accurate over time. For example, anadministrator may periodically review utterances that are not understoodby the command interpreter 152, and categorize such instances as newtraining data. The administrator may retrain the model to incorporatethe new training data, thereby improving the accuracy of the model. Anysuitable technologies may be used to implement the command interpreter152. For example, in an embodiment, the command interpreter 152 iscreated using Azure Language Understanding Intelligence Service (LUIS).Because the present techniques use ML techniques, the present techniquesare resilient to variations in commands. For example, although the MLmodel may only be trained using a labeled command of “Set line text toZ”, the ML model may understand similar but distinct commands, such as“I'd like to set the line text to Z.” Such flexibility in interpretationis advantageous over existing assistant systems, which utilize templatesand are therefore brittle, and unable to adapt to natural variations inspeech patterns.

In some embodiments, the environment 100 may include more than onecommand interpreter 152 corresponding to multiple remote applications,or remote modules, each accessed individually by an applicationexecuting in the computing device 102. For example, a first commandinterpreter 152 may be a Cart voice interpreter 152. A second commandinterpreter 152 may be a GeneralNavigation voice interpreter 152.Additional types of command interpreters are envisioned, such asPowerSearch, CoWorker, Context, etc. Each command interpreter 152 mayencapsulate and/or correspond to a trained ML model for interpretingutterances related to a specific type to identify an intent.

In an embodiment, the environment 100 provides two-level naturallanguage understanding via multiple command interpreter modules. Forexample, the utterance may be run against two or more commandinterpreter 152, wherein each command interpreter 152 provides arespective score indicating the likelihood that the user's intentcorresponds to one of the command interpreters 152. By creating multiplecommand interpreters 152, the developer is advantageously allowed totrain subsets of voice commands by type/topic, rather than requiring thedeveloper to train a monolithic ML model. By separating types of intentinto multiple command interpreters 152, the developer can alsoselectively enable/disable the command interpreters 152, effectivelydisabling/enabling the set of voice functionality available to a user atany given time.

In operation, a user may create (e.g., by accessing the cloud 106 in aweb browser) a remote application in the cloud 106 for grouping one ormore intents. Intents may correspond to user imperatives/commands, andmay be created by developers and/or users of the environment 100.Intents may correspond to different user imperatives. For example, auser may create a set of page navigation intents: GoToCoworkerBylD,GoToCoworkerByName, GoToManagerByName. A user may create a set ofintents for performing day work functionality: Swipeln and SwipeOut. Auser may create intents for retrieving information from a worker'spersonnel records or an organizational chart: AccruedSickTime,AccruedVaction, AnyMeetingToday, Birthday, etc. The user may associateone or more labeled utterances with each intent.

For example, for an intent, “GoToCoworkerByName”, the user may associatethe following utterances, which may include entities, as denoted usingcapitalization: “show me information for coworker Pronoun”; “go tocoworker joe smiths page”; “show me information for jane doe”; “show meinfo for PersonName”; “show me information for coworker PersonName”;“show info page”; “show coworker info page for Pronoun”; “show mecoworker information for jane doe”; “coworker is PersonName”; “show meinformation for coworker PersonName”. The user may, for each of theforegoing utterances, associate a labeled intent with the utterance. TheML model may then be trained using the labeled utterances, and ambiguousutterances may be labeled to reinforce the model. A list of ambiguousutterances may be displayed, along with a respective label. The user mayreview the ambiguous utterances, label them, and retrain the ML model,which improves accuracy.

Voice Control of Enterprise Applications

The operation of environment 100 will now be described with respect toFIG. 2. As seen, FIG. 2 depicts a conceptual model 200 for voice controlof enterprise systems and applications, including a cloud services layer202 and an application 206 that is part of an application services layerand corresponds respectively to the cloud 106 and the computing device102 of FIG. 1. FIG. 2 also depicts a voice control hub 204 that may actas a bridge or proxy between the cloud services layer 202 and theapplication 206. The voice control hub 204 may include one or moreshared objects 210, an audio management module 212, a lifecyclemanagement module 214, a global entity directory 216, a command handlingmodule 218, a channel subscription manager 220, and a wake managementmodule 222.

The one or more shared objects 210 may be computer-executable librariesthat are shared by executable files. A user implementing a voice-enabledapplication, such as the application 206, may link the application 206to one or more of the shared objects 210. The shared objects 210 may, inturn, enable the application 206 to access certain functionalityprovided by the voice control hub 204 via a language-level API. Inparticular, the voice control hub 204 may be encapsulated as a singleshared object library (e.g., a .DLL or .so) that can be deployed toprovide turnkey voice capabilities to an existing application. In someembodiments, the voice control hub 204 may be implemented using aprogramming language such as Java, C-Sharp, etc. The shared library mayinclude one or more modules of the computing device 102, such as thespeech-to-text module 116, the intent module 118, the framework module120, etc. and/or modules of the voice control hub 204.

The audio management module 212 includes events and properties. Theevents may include a set of events, or state transitions, that arepossible in a voice-enabled application. For example, the audiomanagement module 212 may include an OnAudioPlaybackSuccess event, anOnAudioPlaybackDeviceError event, an OnRecordingDeviceError event, anOnMicrophoneStatusChanged event. The voice control hub 204 may capturethese events, and others, in response to operating system-level events.The voice control hub 204 may allow a user implementing the voicecontrol hub 204 to register one or more callback function with respectto each event. For example the user may register a callback function todisplay a message to the user on a screen (e.g., the output device 142of FIG. 1) when a microphone changes status. The properties of the audiomanagement module 212 may allow the user to retrieve informationregarding audio interfaces. For example, the properties may include thecurrent capture device, the current playback device, whether a defaultrecording device is attached, whether an audio initialization failurehas occurred, etc. In some embodiments, audio device management isautomatic and handled by the operating system, and an applicationimplementing the voice control hub 204 need not explicitly managehardware. However, the user (e.g., a developer) may find that theability to access events and properties provides advantageous visibilityinto audio interfaces that is lacking in other systems.

The lifecycle management module 214 of the voice control hub 204initializes the voice control hub 204 and provides start and stopfunctionality. Specifically, the lifecycle management module 214provides an initialize function, an activate function, and a deactivatefunction that an application implementing the voice control hub 204(e.g., the application 206) may call to, respectively, initialize,activate, and deactivate a voice interface.

The global entity directory 216 of the voice control hub 204 may be usedby the application 206 to influence how the voice control hub 204decodes an entity (e.g., the OrderNumber entity discussed above) bysetting values in a global context visible to a handler object. Forexample, the application 206 may set a current order number from aglobal context of the application 206, and the current order number willbe visible to an event handler (e.g., the event handler 122). Therefore,the user can programmatically override the entity, which may beotherwise determined via a voice command.

The command handling module 218 of the voice control hub 204 responds tospeech audio data (e.g., audio received as input to the input device 140) and translate that audio data into standardized intents/entities. Inan embodiment, the command handling module 218 may transmit the voiceaudio data via a network (e.g., the network 104) to a speech-to-text APIin the cloud services layer 202 (e.g., the speech-to-text API 150 ofFIG. 1), and receive a response including a textual string correspondingto the speech audio data. The command handling module 218 may thentransmit the textual string to a command interpreter API in the cloudservices layer 202 (e.g., the command interpreter API 152 of FIG. 1) andretrieve/receive a structured data response including one or more intentand one or more entity, as discussed above. In some embodiments, thespeech-to-text API and the command interpreter API may be queried in asingle network call to the cloud services layer 202. For example, theenvironment 100 may include a second computing device 102 to which theaudio data is transmitted, wherein the second computing device 102 callsthe APIs in the cloud services layer 202 in parallel, collates theresponses, and returns a single response to the computing device 102 viathe network 104.

In some embodiments, the command handling module 218 may include a setof instructions for implementing multi-turn communication with the user.For example, the user may be utter a “go to power search” command. Theuser's command may be translated by the command interpreter 152 as anintent (e.g., PageNavigate) and an entity (e.g., PowerSearch). Theintent and/or the entity may be included in a predetermined list ofmulti-turn intents/entities. The command handling module 218 may comparethe identified entity (e.g., PowerSearch) to the predetermined list ofmulti-turn entities. When the identified entity is a multi-turn entity,the command handling module 218 may enter a multi-turn mode, wherein theuser is repeatedly prompted to provide information until a condition ismet. For example, the predetermined list may be a hash table storing alist of required attributes in association with the identified entity,such that, for example, PowerSearch is associated with fields Name,Date, and Description. The command handling module 218 may successivelyprompt the user to provide, via voice, a name, a date, and adescription. A special command handling function may be created tocollect required attributes when the command handling module 218 detectsa multi-turn entity. The predetermined list may be stored, for example,in a database such as the data 130.

In some embodiments, the application 206 may establish a context for anintent handler to facilitate multi-turn commands. The context may beestablished in a first multi-turn handler, and then made available tosuccessive multi-turn handler, so that, for example, a second and thirdmulti-turn handler may access the context to determine whichinformation, if any, is still required of the user.

It should be appreciated that the command handling module 218 mayinclude instructions for role-based permission and security.Authentication information of a user may be passed by an application tothe voice control hub 204 each time the user issues a voice command. Thecommand handling module 218 may analyze the authentication informationof the user to determine the permission level of the user with respectto data access and command execution. For example, a human resourcesuser may have permissions enabling the user to access personnel records,whereas a sales user may not have such permissions. When the sales userrequests salary information of another user, the command module 218 mayfirst determine whether the sales user is authorized to access salaryinformation by consulting a database (e.g., the database 130). If theuser is not authorized, the voice control hub 204 may return an error.In some embodiments, initializing the voice control hub 204 may includepassing a path to an LDAP resource containing permission information forthe users of an organization. Therefore, an advantage of the voicecontrol hub 204 is compatibility with existing security and permissionschemes. In some embodiments, the command handling module may blacklistsome terms (e.g., a vulgar or an offensive word). For example, thespeech-to-text module 116 of FIG. 1 may cross-reference a string of textreceived from a remote speech-to-text API against a dictionary ofblacklisted terms. The intent module 117 of FIG. 1 may perform similarchecks, wherein the checks include comparing the access level of theuser issuing the command to a list of allowed entities/intents. In anembodiment, the command handling module 218 may log all queries inassociation with the identity of the authenticated user. For example,the command handling module 218 may store a row in the database 130 foreach query issued by the user, along with the ID of the user, and atimestamp. In this way, an audit trail of queries is created fordebugging and security purposes.

The channel subscription manager 220 of the voice control hub 204 allowsobjects inside the application 206 to hook into the voice control hub.From the perspective of an application implementing the voice controlhub 204, such as the application 206, the logic and workings of thecommand handling module 218 are hidden. The application 206 need onlyregister voice handlers that are triggered when a particularintent/entity are received by the voice control hub 204. As discussedbelow, any application object, of any kind (e.g., a GUI a view model, abusiness object, etc.) may register handlers and subscribe to aparticular channel.

A channel refers to a logical separation between one or more topics(e.g., one or more intents). For example, a user may use instructionsprovided by the channel subscription manager 220 to subscribe a firstapplication to an action channel for all Action intents. Therefore, anyaction received by the voice control hub 204 would be routed only to theaction channel, and to the first application. Of course, additionalapplications could subscribe to the same channel, and receive aduplicate notification when the action intent is received in the voicecontrol hub 204. The user could use the instructions to subscribe asecond application to a query channel, such that all Query intents wouldbe routed only to the query channel, and the second application (inaddition to other applications subscribed to the query channel). In someembodiments, multiple channels may be used to subscribe to intents ofthe same type (e.g., a first channel subscription to a Swipeln intentand a second channel subscription to a SwipeOut intent). The channelsubscription manager 220 provides the user with the ability to subscribeto intents and/or entities at any level of granularity. The user maysubscribe an application to one or more channel by, for example,entering a subscription instruction via an input device. Thesubscription instruction may be part of the application's code, or aseparate module.

The wake management module 222 of the voice control hub 204 allows thevoice control hub to be “woken up.” The user may awaken the voicecontrol hub 204 via a wake phrase received via a microphone, via akeystroke via a keyboard, etc. For example, the awakening input may beprovided to the input device 140 of FIG. 1, and may include any suitableinput such as a shaking of a device having an accelerometer, one or morekeystrokes, a spoken keyword, etc. The user may set the wake phraseand/or wake key combo via the properties of the wake management module222. The user may also set the wake mode by choosing between the wakephrase, the key combination, or both. An OnWakeModeChanged event may betriggered any time the wake mode is changed, and the application 206 maydetect such events and take action based on such detection (e.g., bydisplaying a message on an output device).

In some embodiments, the wake management module 222 may process a wakeword and following spoken words together. For example, a user may speaka phrase such as, “Who is Joe Smith's manager?”. The wake managementmodule 222 may continuously monitor all audio voice data until the wakemodule 222 encounters the wake word, and the wake management module 222may capture all words following the wake word as a query. To monitor awake word contained in audio voice data, the wake management module 222may transmit the spoken words received from an input device to aspeech-to-text API, such as the speech-to-text API 150 of FIG. 1. Insome embodiments, monitoring the wake word may include the wakemanagement module 222 analyzing words locally (e.g., in the audiomanagement module 212 of FIG. 2). If the wake mode is configured for akey press wake indication, then the wake management module 222 maylisten for a key press event (e.g., a key press emitted by the inputdevice 140 of FIG. 1). By relocating the wake word monitoring to a localmachine (e.g., a mobile computing device), the present techniques mayadvantageously eliminate delay and latency.

In some embodiments, the application 206 may include a timeout period(e.g., a number of seconds) which keeps the voice control hub 204active. While the user is using the application 206 during the timeoutperiod, no wake word may be required, because the application iscontinuously active. Therefore, the user may issue voice commands to theapplication 206 without any need to use a wake word. The always-oncapabilities of the present techniques represent an advantage overexisting commercial systems which require each command to be prefixed bya particular keyword or phrase.

It should be appreciated that the techniques described herein do notrequire a wake word or wake up mode. For example, in an embodiment, thevoice control hub 204 may continuously listen to the user's speech, andprocess the audio speech data of the user as a stream of input with anindefinite termination (e.g., no wake/asleep status). Such a mode may beused, for example, to provide a dictation service.

The application 206 may include any number of components, such as one ormore view model (e.g., view model 230 and view model 232) in addition tobusiness objects, GUls, etc. In general, the application 206 isconfigured to allow a developer to enable voice functionality for thecomponents, as discussed further below.

It should be appreciated that some embodiments may include multipleinstances of the application 206. For example, an account manager usermay have a first application 206 open on the user's screen throughoutthe day as the user services the accounts of customers of theenterprise. The first application 206 may be an email application, fromwhich the user reads email messages from customers throughout the day.The account manager user may read an email from a customer requestingthat the account manager user create an order for an item. The accountmanager user may utter, “Create an order from this email.” The utterancemay be processed as described above, and ultimately, a command handlermay create a draft email in a second application 206, filling in contextinformation retrieved from the first application 206. Usefulinteractions between many different types of applications areenvisioned.

Voice-Driven Applications

The operation of environment 100 will now be described with respect toFIG. 3. As noted above, an application (e.g., the application 206 ofFIG. 2) may implement the interfaces/modules of a voice control hub toimplement voice-driven functionality, and application objects of anykind may enable voice-driven functionality. FIG. 3 depicts a schematicclass diagram 300 illustrating the different aspects of voice control anapplication may achieve. The class diagram 300 includes a voice controlhub class 302 having a wake mode property 304. The voice control hubclass 302 may correspond to the voice control hub 204 of FIG. 2, and maybe the primary interface into the voice control hub. The voice controlhub class 302 may incorporate wake handling, voice channel subscription,and voice request handling request logic. The wake mode property 304 maycorrespond to the wake mode property of the wake management module 222of FIG. 2. The voice control hub class 302 may include an intent handlerclass 306, a voice transcription class 308, and a speech synthesis class310 which, respectively, provide details for a voice command includingraw transcription details and allow the intent handler to generatespeech-based responses to a user.

The intent handler class 306 receives notification when a voice commandis received/retrieved (e.g., from the cloud 106 of FIG. 1 or the cloudservices layer 202 of FIG. 2). The intent handler class 306 may receivenotification from the speech-to-text module 116 and/or the intent module118 of FIG. 1, for example. The class diagram 300 includes a voiceintent class 312 and a voice intent entity class 314 that interoperatewith the intent handler class 306. The voice intent class 312 handles avoice command that has been decoded into one or more intents and one ormore entities. The voice intent class 312 provides the decoded intentsand/or entities in a form that a voice handler (e.g., another set ofexecutable instructions) may operate on directly, avoiding the voicehandler needing to operate on audio speech data directly. The voiceintent entity class 314 represents the one or more entities that wereidentified from spoken text.

Handler Registration

The application may register the voice intent handler class 306 at anyinstruction comprising an application (e.g., in a constructor, duringinitialization, etc.). The application first registers/obtains the voicecontrol hub handler class 302, and then registers a handler object(e.g., a function, a class, a message queue, etc.) that will respond toa voice command. In an embodiment, a form of registration is providedthat allows a handler to be automatically unregistered when the handlergoes out of scope of the application, which may be advantageous when thelifecycle of the handler is complicated or non-deterministic.

In an embodiment, an intent handler may be a class (e.g.,VoicelntentHandler) that implements a function (e.g., HandleVoicelntent)that receives one or more argument (e.g., a HandleVoicelntentArgsargument) that offers additional information and/or functionality. Thefunction may return a boolean value to indicate an execution state ofthe function. The intent handler may implement the VoicelntentHandlerclass 306 of FIG. 3, for example. The argument may expose functions,such as a speech synthesis function. It should be appreciated that thefunction and class names are provided for exemplary purposes. The intenthandler can be named differently, in some scenarios. The intent handlermay implement the VoicelntentHandler class 306 class, which may onlyrequire that a single specific function be implemented, wherein thatsingle specific function receives the handler argument/arguments.

The function may perform many tasks related to event handling on behalfof a user's spoken commands. For example, in an embodiment, the functionmay pass a string argument to the speech synthesis function to cause thespeech speech function to speak to the user via an output device (e.g.,the output device 142 of FIG. 1). In some embodiments, the output devicemay be located in a mobile computing device of the user that is a partof the environment 100 and which is linked to the computing device 102via the network 104. The speech synthesis capabilities of the speechsynthesis function may be provided by, for example, the speech synthesisclass 310.

In another embodiment, the function may examine the intent nameassociated with the argument. Specifically, the argument corresponds tothe intent included in the user's voice command. When the intent namematches a particular intent type (e.g., Swipeln), the event handlerfunction may determine then use speech synthesis to notify the user viathe output device (e.g., via an audio speaker) that “You have beenswiped in.”. For example, the speech synthesis class 310 may cause a.WAV file to be played back via the audio speaker. When the user isalready swiped in, the function may inform the user via speech synthesisor another method (e.g., a display message on an output display) that,“You are already swiped in.” The function may also access any entitiesincluded in the user's voice command and perform any desirable analysis(e.g., checking a unit price or dollar value of an Orderltem entity).

In an embodiment, a presenter or view model (e.g., view model 230 ofFIG. 2) may implement voice intent handler class 306, in addition to aview model class (e.g., a Windows Presentation Framework View Model).The view model class may implement a Model-View-ViewModel (MVVM) GUIdesign pattern. Then, the function handling the voice intent maymanipulate the ViewModel appropriately by determining the intent namereceived in the argument. For example, the handler function may set aselected filter pricing list GUI component according to whether theentity corresponds to a landed cost, MSRP, advertised price, or adjustedsimulated price. Of course, many more GUI view model adjustments areenvisioned. As noted, in some embodiments, the handler function maynavigate to a web page within a web site when a particular intent isreceived. For example, a command such as “Go to Power Search” may beinterpreted as a NavigationPowerSearch intent. When the handler receivesthe intent, the handler may manipulate the ViewModel to display a powersearch page. It should be appreciated that for programs that include theMVVM pattern, the GUI layer can be changed (e.g., by a backend codeupdate) without disturbing the functionality and capability of thevoice-based features. For example, if a Windows Presentation Framework(WPF) user element is modified, voice integration will still functioncorrectly, because the voice functionality is not tied to any particularGUI code or markup language.

As noted above, the function may retrieve information related to anindividual's personnel records or an organizational chart. For example,the handler may retrieve a vacation balance, a telephone extension, amanager of a person (e.g., an entity), etc. The handler may use themanager intent and person name entity as a parameter, wherein thepersonnel database is queried for the manager of the person (e.g.,GetManager(personName=joe)). Other parameterized queries are envisioned.

Intent Processing Examples

As noted above, the user may create a set of page navigation intents.The navigation may be according to a page (e.g., Power Search). The usermay utter a command, such as “Go to current order of customer.” Userutterances include, inter alia, order creation (“Create new order forcustomer Bob Smith.), customer location (“Find a customer matchingname”), e-commerce cart creation (“Create a cart”), e-commerce cart use(“Add item EDC number 123456 quantity 2 to cart”), manipulation ofline-oriented graphical user interfaces (“Add a new line”, “Set linetext”, etc.), setting a field in an order (“Set note: customer wantsASAP”), order retrieval (“Get current orders”), setting search criteriaand querying (“Find keyboard products up to price $200”), inventorylookups (“Find in stock items”), etc. The user may submit forms andperform other GUI actions (“Begin search”) using search parametersentered into one or more fields of the GUI by voice. The user may updatecustomer and company representative information (“Change contact toJohn, email john@example.com. Change account manager to Jeff.”). Theuser may place an order once the order is complete (“Place order”).

It should be understood that allowing the user to navigate businessprocesses and applications via voice is in many cases much faster thantraditional non-voice user interfaces (e.g., via mouse and keyboard).Using the present techniques, the user may not need to explicitly openany screens or windows within an application. Rather, the user merelyspeaks the user's intent and the application, which understands thecompany's business context, performs an appropriate action.

Convolutional Neural Network for Identifying Fields

Manipulation of GUIs using the voice-based features discussed herein mayrequire a component (e.g., the voice intent handler class 306) to have apriori knowledge of an application. In particular, the identifier of aparticular GUI widget or component may be required in order for thehandler function to reference and update the widget or component. Forexample, in an HTML setting, a GUI element has a class and/or an ID thatcan be used to unambiguously obtain a reference to the GUI element.Cascading Style Sheet (CSS) selectors, XPath, etc. are examples ofexplicitly referencing elements via a unique address.

However, the voice control hub 204 is intended to allow developers andend users to naturally refer to elements of a GUI without codingexplicit references, or speaking specific references. For example, auser may state, “update the text box: user wants priority shipping”.Requiring the user to know the unique identifier for the notes fieldwould be cumbersome and unworkable at least because such identifiers areoften intended to be machine-readable (e.g., a UUID such as5fda362e-5c76-4867-a9d7-4384b69dc582) and not intended to beuser-friendly. While internal control names such as UUlDs may be used insome embodiments, the present techniques include more natural ways forusers to reference GUI elements.

Further context will now be provided with reference to FIG. 4. FIG. 4depicts an example GUI 400. The GUI 400 allows a user (e.g., an accountmanager) to create an order on behalf of a customer. As noted above, theuser can navigate and set the values of a plurality of fields 404. Forexample, the user may state, “update quantity to 10”. The user'sutterance may be recorded as speech audio data (e.g., a WAV file) by theinput device 140 and received by the framework 120 of the computingdevice 102 of FIG. 1. The framework may delegate the audio data to thespeech-to-text module 116, which may transmit the audio data to thespeech-to-text API 150. The framework may receive a response string fromthe speech-to-text API 150 and submit the string to the commandinterpreter 152. The intent handler 118 may then receive a structureddata set denoting an intent (e.g., “Update”) and an associated entity(e.g., Quantity) and an entity parameter (e.g., 10). The framework 120,and/or the command handling module 218 of the voice control hub 204 ofFIG. 2, may dispatch the intent, entity, and entity parameter to ahandler function. The handler function may be one that is subscribed toquantity events. For example, the user may have previously registered ahandler for intents/entities of a particular type (e.g., quantityentities) via the channel subscription manager module 220 of FIG. 2.Therefore, the quantity entity is sent to a particular quantity handlingfunction, as discussed with respect to the voice intent handler class306 of FIG. 3.

In some embodiments, the handling function may include a reference tothe field in the plurality of fields 404 (e.g., the quantity). In otherembodiments, the handling function may access a pre-determined mappingof entities to fields to obtain a reference to the appropriate GUIcomponent.

In other embodiments, the quantity handling function may call a trainedML model, passing the entity as a parameter, wherein the trained MLmodel analyzes the entity parameter, and the GUI 400, to generate areference to the appropriate field.

The trained ML model may be may be a CNN that is trained to analyzelabeled digital images, wherein the labeled digital images depict GUIelements with associated component labels. Using a CNN to identify whichfield is associated with a label allows mapping to use screen text andnot internal control names. For example, as in FIG. 4, a notes fieldcomponent has a component label 406 and an input field 408. The CNN maybe trained using a set of tuples, wherein the first element is adescription of the element (e.g., “text box”) and the second element isan image of the input field 408. Other component types and labels may beincluded in the training data set. The CNN may be trained to identifyGUI controls based on a description of those controls.

For example, FIG. 5 depicts a CNN 500, including an image 502-1. Theimage 502-1 may be represented using a three-dimensional tensorincluding raw pixels of the image 502-1 and three color channels (red,green, and blue). The image 502-1 may be analyzed by a first convolutionoperation 504-1 which scans the image 502-1 using a filter to produce astack of activation maps 502-2, wherein one activation layer is createdfor each filter. The filter may downsample or otherwise manipulate theimage 502-1. A second convolution operation 504-2 may produce a secondactivation map 502-3, and a third convolution operation 504-3 maygenerate a third feature map 502-4, and a fourth convolution operation504-4 may generate a fourth feature map 502-5. Any suitable number ofconvolution operations may be used. Finally, a fully-connected layer504-5 may generate an output 506-1. Depending on the training modality,the output 506-1 may be a pixel location, a reference to the GUI object,etc.

The CNN 500 may be trained to identify particular components (e.g., thetext box 408 ). In an embodiment, the output 506-1 may be a pixel heightand pixel width representing the bottom left corner of the desired HTMLelement. In some embodiments, the output 506-1 may be a reference to anobject in memory (e.g., a pointer address to a GUI element). In anotherembodiment, the output 506-1 may represent coordinates for a bounded boxaround the desired element. In some embodiments, the CNN 500 may alsoidentify GUI controls based on nearby textual labels, in addition to thecontrol type. For example, the CNN 500 may identify a GUI control basedon a description of its data (e.g., “the quantity field”) or its type(e.g., “the input field”). The application using the CNN foridentification may employ multi-turn communication, as discussed herein,to disambiguate user queries (e.g., in an application having multipleinput fields).

Once the CNN 500 is trained, a screen capture (e.g., a screen shot) ofthe GUI 400 may be passed to the CNN 500, along with a description of asought element (e.g., “text box”). The CNN 500 then outputs the locationof the desired element with respect to the screen capture. Anapplication (e.g., the application 206 of FIG. 2) may then use theoutput 506-1 to locate the desired element in the GUI 400. It should beappreciated that the screen shot may be of any application (e.g., a PDFfile, an image, a web page, etc.).

Therefore, continuing the above examples, when the user states “updatethe text box: user wants priority shipping”, the application may callthe CNN, passing an entity (e.g., “text box”) and a screen capture ofthe current GUI screen as parameters. The CNN convolves the screencapture and the entity input, returning an identifier (e.g., a path, areference, a pixel location, etc.) to the application. The applicationmay then use the identifier to manipulate the identified GUI control bychanging the control, editing its content (e.g., “user wants priorityshipping”), etc. As above, the application may analyze the intent todetermine an operation to perform (e.g., an update, a delete, anaddition, etc.).

Fuzzy Matching

Sometimes, a user's utterance may include an entity name that is notexact, even if an exact name is included on a GUI. For example, theCustomer dropdown box in the plurality of fields 404 may include thename Joseph Jones. If a user's voice command includes, “set customer tojoe jones”, the intent may be an UpdateCustomer intent and the entitymay be “Joe Jones”. An intent handler for performing an update of thecustomer dropdown element may include a set of instructions fordetermining whether the named entity (e.g., Joe Jones) is found in theGUI control, and if not, for performing a fuzzy match. The fuzzy matchmay include using a string distance function, a longest commonsubsequence algorithm, the Levenshtein distance algorithm, aSoundex/soundex-like function, etc. The fuzzy match may includeanalyzing a custom element such as the starting initials of words,and/or permutations of word ordering. Therefore, names, products, etc.need not be exactly uttered by a user, and yet the handling functionwill still correctly carry out the user's intent.

Dynamic Handler Compilation

Voice applications are sensitive to latency. A human used to conversingwith another human may find delays in response from a computer voiceapplication frustrating and unnatural. The present techniques includedynamic compilation of handlers, which results in several benefits.First, dynamic compilation results in performance gains during programexecution. Rather than compiling code during runtime when the user iswaiting for a response, code is compiled when the applicationinitializes, and merely executed at runtime. FIG. 6 depicts an exampleGUI 600 for setting a dynamically-compiled handler. The GUI 600 includesa settings menu 602 including a plurality of setting keys, each having arespective setting value, creation date, and modification date.

The settings menu 602 includes a dynamic compilation key 604. When theuser clicks on the dynamic compilation key 604, the user is providedwith the value associated with the dynamic compilation key 604 in anedit value box 606. At any time, the user may modify the handler in theedit value box 606, and the application with which the handler isassociated (e.g., the application 206 ) will recompile the updatedhandler and use it for voice control operations as discussed with FIG.3. The depicted handler function in the edit value box 606 maycorrespond to, for example, the handler function for updating quantitydiscussed above with respect to FIG. 4, the HandleVoicelntent functiondiscussed with respect to FIG. 3, etc.

In addition to providing a more responsive runtime performance, anotherbenefit of the dynamic compilation aspect of the present techniques isthat handler functions may be edited while the application is deployedin a production environment without requiring a redeployment. Thus, thedynamic compilation aspects are useful to programmers by reducingcomputational resources required for operating enterprise environmentdeployments, simplifying developer workflow, and allowing the developerto debug in a real-world computing environment, such as the computingenvironment 100. Further, in the event of a software bug in a handler,or a change in business requirements, the developer can agilely updatethe handler code without performing a redeployment.

Visual Programming for Intent Dispatch

As noted above, conventional voice systems require developers to writecode to implement functionality, and as such, do not allow end users toadd voice commands. The present techniques overcome these limitationsusing, inter alia, the voice intent recognition, RPA, computer vision,and dynamic compilation techniques discussed herein. Specifically, thepresent techniques include voice integration using an action palette,wherein an end user (e.g., a business person) can map voice intents toactions to create custom voice-driven behaviors, without writing anycode.

FIG. 7 depicts a flow diagram of a method 700 of visual programming forintent dispatch. The method 700 includes performing a speech-to-textoperation on an utterance 702 (block 704). The speech-to-text operationat block 704 may be performed by, for example, the speech-to-text API150 of FIG. 1. The speech-to-text operation at block 704 may generate atext string that is analyzed to recognize one or more intent and one ormore respective entity (block 706). The entity and intent recognitionmay be performed by, for example, the command interpreter 152 of FIG. 1.The intent and entity may be mapped to one or more actions (block 708).

The method 700 may include determining whether an action involves a GUI(block 710 ). When one of the one or more actions is a GUI action, themethod 700 may include mapping an action to a control (block 712 ). Insome embodiments, mapping an intent or action to a control may useinternal control names, as discussed above. However, in someembodiments, a CNN 714 may be used for mapping. The CNN 714 maycorrespond to the CNN 500 of FIG. 5, for example.

Action Palette

In some embodiments, the mapping of intents/entities to one or moreactions at block 708 may be enabled by a mapping created by a user usingan action palette tool. As seen in FIG. 8, an action palette GUI 800includes a first window 802 and a second window 804 that includes one ormore selected actions.

The first window 802 is an action palette menu, corresponding to anintent 806. For example, the intent 806 may relate to the utterance 702of FIG. 7 and may have been detected by the intent/entity analysis atblock 706 of FIG. 7. The first window 802 includes a set of actions808-1, which may include GUI-based actions for performing RPA, in someembodiments. The first window 802 includes a set of actions 808-2, whichmay relate to a specific application, such as the application 206 ofFIG. 2. The first window 802 includes a set of actions 808-3, which mayinclude data operations that access a database, such as the database 130of FIG. 1. The first window 802 includes a set of actions 808-4, whichare process actions, such as sending an email. A user may select any ofthe actions listed in the set of actions 808-1 through 808-4, toassociate those actions with the intent 806.

The actions in the set of actions 808-1 through 808-4 may be created,edited, and deleted by a user, such as an administrator or a businessuser. The actions may be shared among users. For example, a firstbusiness user may create a new action to automate a particularlyrepetitive or tiresome task. The GUI 800 may include instructions thatallow the first user to share the action with a second business user, sothat the second business user can use the action to automate the sametask.

When the user selects an action, the GUI 800 may add the selected actionto a list of selected actions in the second window 804. The GUI 800 mayalso allow the user to drag an action from the first window 802 to thesecond window 804 to select the action. The user may reorder theselected actions in the second window 804 via drag and drop.Specifically, in the depicted example, the user has selected an updatelabel action 810, a find customer action 812, an update customer action814, and a send email action 816. These actions compose the set ofactions that will be performed in response to the intent 806. The userhas ordered the actions in the second window 804, such that the firstaction performed is a find customer action 820, corresponding to thefind customer action 812, the second action performed is an updatecustomer action 822, corresponding to the update customer action 814,the third action performed is a send email action 824, corresponding tothe send email action 816, and the fourth action performed is the updatelabel action 826, corresponding to the update label action 810.

Once the user is satisfied with the selected actions, and theirordering, the user may activate a save button 830 of the second window804. The GUI may include instructions for saving the selected actions inassociation with the customer intent in the database 130. In operation,when a handler function receives an intent matching the intent 806(e.g., UpdateCustomer), the handler function may query the database 130to select the set of selected actions associated with the intent 806.The handler function may then execute each of the selected actions inthe order established by the user in the second window 804.

It should be appreciated that the action palette GUI 800 of FIG. 8harnesses the voice processing and intent determination techniquesdisclosed herein to provide an intuitive way for a user, including anon-technical user, to create new actions and map those actions tointents. The GUI 800 may have additional features and functionality,such as a wizard that walks the user through the process of adding a newvoice-enabled command, and associating that command with an action.

In some embodiments, a dynamic descriptor may be used to map intents toactions. For example, once an intent is determined, it may beadvantageous to allow multiple GUI controls to be updated.

<intent name=“Keyword”> <action type=‘populateTextBox’controlLabel=KEYWORDS> <source entity= Keyword singularize=true></action> <action type=‘selectItem’ controlLabel=COMPANY> <sourceentity=CompanyCode> </action> <action type=‘populateTextBox’controlLabel=FROM> <source entity=PriceFrom> </action> <actiontype=‘populateTextBox’ controlLabel=TO> <source entity=PriceTo></action> </intent>

In the above example, the descriptor results in the population ofmultiple control elements. It should be appreciated that the descriptorfollows screen labels, not internal names. Therefore, dynamicdescriptors may be used to allow users to create their own functionalityusing natural names of screen components.

Further, it may be advantageous to perform a lookup in a database and topopulate a GUI control based on a fuzzy match. A dynamic descriptor maybe created as follows:

<intent name=“CustomerName”> <action type=‘populateTextBox’controlLabel=CUSTOMER> <source entity=CustomerName> <lookupsource=CustomerStore fuzzy=true> </action> </intent>

In the above example, the descriptor specifies that the customer namewill be retrieved from customer database and, via fuzzy matching, thecustomer dropdown will be populated with the retrieved value. Thedynamic descriptors may be dynamically compiled at runtime and stored,for example, in the settings menu 602 of FIG. 6.

In particular, a dynamic descriptor may be compiled and inserted intothe intent listener chain. The descriptor may be changed freely tochange/extend/correct behaviors. The descriptor may be invoked by anintent handler when the intent (e.g., CustomerName) matches thedescriptor. The compiled descriptor may be registered with a handlerfunction and accessed from within the handler function in the context ofprocessing an intent argument. It should be appreciated that thefunctionality implemented using dynamic descriptors could be implementedusing other programming techniques.

For example, rather than using a descriptive markup language to specifya mapping between intents and actions, a programmer could write explicitcode (e.g., using a series of IF . . . THEN statements). However, suchcode would result in a series of steps to be taken. By rather describingdata to be matched, and processing required, the present techniquesallow data and functionality to be decoupled. Therefore, thefunctionality can be updated by modifying the descriptor withoutrequiring any changes to the data (e.g., the intent/entity).

FIG. 9 depicts an example flow diagram of a method 900 for enablingvoice functionality in an application, according to an embodiment. Themethod 900 may include receiving a handler registration requestspecifying an object handler to respond to voice commands (block 902 ).For example, an application such as the application 206 may access aclass library including the classes depicted in the class diagram 300 ofFIG. 3. The application may implement the IVoiceControlHub class 302 toaccess wake handling, voice channel subscription, and voice requesthandling. The application may set the wake mode (e.g. a wake word orkeystroke combination) using the wake mode property 304. The applicationmay implement the IVoiceIntentHandler class 306 to receive notificationswhen a command arrives that is relevant (e.g., a command matching adesired intent). The application may implement the IVoiceIntent class312 to process intent and entity information. The application mayrepresent intents and entities using the IVoiceIntentEntity class 314.The application may implement the IVoiceTranscription class 308 toaccess details for a voice command containing raw transcription details,and may implement the ISpeechSynth class 310 to generate speech-basedresponses to the user. The application may register a voice handleranywhere in the application, or in the constructor/initialization codeof any object. A RegisterintentHandler function allows the applicationto register an object (e.g., a function) that will respond to voicecommands. In some embodiments, the method 900 compiles the objecthandler as a dynamic function.

As discussed above, the registered object may receive argumentsincluding a reference to the user, an intent, an entity, etc. Theregistered object may inspect properties of the received object (e.g.,an entity name) and take actions based on the inspected properties, suchas modifying the status of a user record (e.g., a time clock), receivinguser information from a database, such as a human resources database(e.g., a user's vacation balance, manager identity, etc.), updating thestatus of a GUI, navigating a page, accessing another application (e.g.,sending an email), etc. It should be appreciated that the intent handlercan perform any suitable action. It should also be appreciated that asdiscussed above, the intent handler may have access to a global contextallowing the intent handler to preserve a state of the application overtime. By preserving state, the intent handler may implement multi-turninteractions and other actions requiring information to be shared amongmultiple intent handlers. As discussed, the present techniques supportdynamic compilation of handlers at runtime to improve performance andallow functionality to be edited at runtime, as discussed with respectto FIG. 6. The method 900 may include providing one or more of theclasses of FIG. 4 as shared objects, for example, via the shared objectmodule 210 of FIG. 2.

The method 900 may include receiving an utterance of a user (block 904). The utterance may be received via an input device coupled to a devicethat the application is executing in. For example, the input device maycorrespond to the input device 140 of FIG. 1. The input device may be amicrophone that is integral to a device (e.g., a mobile computing deviceof the user) or a wired or wireless microphone attached to a stationarydevice such as a desktop computer. The method 900 may include performingaudio management functions to make the input device available and toread data from the device. For example, the method 900 may (e.g., usingthe audio management module 212 of FIG. 2) initialize the audio device,load a software driver, query properties of the device, and processother events/properties of the device, as discussed herein. The method900 may store an audio file of the received utterance (e.g., a WAV file,an MP3 file, etc.) in a storage device, such as the database 130 or thememory 114 of FIG. 1.

In an embodiment, the method 900 includes continually recording audiodata from an environment, and continuously converting the audio data totext via speech-to-text functionality. The method 900 may continuallyanalyze the text until a wake word is found. Then, in response to thepresence of the wake word, any successive utterances may be interpretedas a command, and utterances prior to the wake word may be discarded ortemporarily stored for later analysis. In another embodiment, the method900 may record audio data of an utterance in response to a key press.

The method 900 may include transmitting the utterance of the user to aremote cloud services layer (block 906 ). For example, the method 900may use the NIC 112 to transmit the utterance via the network 104 to thecloud 106 of FIG. 1 or the cloud services layer 202 of FIG. 2.Specifically, the command handling module 218 of FIG. 2 may submit anHTTP POST request including the audio data of the utterance as a payloadto an API, such as the speech-to-text 150 of FIG. 1. In an embodiment,the API may be an API of an intermediate server located in the cloudthat acts as a proxy or reverse proxy between the cloud and the voicecontrol hub 204.

The method 900 may include receiving or retrieving an intent and anentity from the remote cloud services layer, wherein the intent isassociated with the entity (block 908 ). For example, as discussedabove, the command handling module 218 may receive a JSON response fromthe cloud (e.g., the command interpreter 152 ) that includes one or moreintent, wherein each intent is associated with one or more entities.

The method 900 may include dispatching the intent and the entity to theobject handler (block 910 ). In some embodiments, the method 900 maydispatch every intent/entity received to each registered event handler.In some embodiments, the method 900 may dispatch the intent/entityreceived only to those event handlers subscribed to a channel associatedwith the intent and/or entity. The method 900 may also store theintent/entity in a database, such as the database 130 of FIG. 1. Itshould be appreciated that dispatching the intent and entity may includethe method 900 transferring the intent/entity to another computingdevice, in some cases, via an electronic network.

In an embodiment, the method 900 receives a channel subscriptionspecifying a channel and one or both of (i) an intent type, and (ii) anentity type. The method 900 stores channel subscription information(e.g., the channel name, a socket a socket address of the subscriber,etc.) in a database. Subsequently, when an intent/entity isreceived/retrieved, the method 900 dispatches the intent and entity tothe channel based on the channel subscription information.

Of course, the applications and benefits of the systems, methods andtechniques described herein are not limited to only the above examples.Many other applications and benefits are possible by using the systems,methods and techniques described herein.

Furthermore, when implemented, any of the methods and techniquesdescribed herein or portions thereof may be performed by executingsoftware stored in one or more non-transitory, tangible, computerreadable storage media or memories such as magnetic disks, laser disks,optical discs, semiconductor memories, biological memories, other memorydevices, or other storage media, in a RAM or ROM of a computer orprocessor, etc.

Moreover, although the foregoing text sets forth a detailed descriptionof numerous different embodiments, it should be understood that thescope of the patent is defined by the words of the claims set forth atthe end of this patent. The detailed description is to be construed asexemplary only and does not describe every possible embodiment becausedescribing every possible embodiment would be impractical, if notimpossible. Numerous alternative embodiments could be implemented, usingeither current technology or technology developed after the filing dateof this patent, which would still fall within the scope of the claims.By way of example, and not limitation, the disclosure hereincontemplates at least the following aspects:

1. A voice control hub computing system, comprising one or moreprocessors, and a memory containing instructions that, when executed,cause the voice control hub computing system to: receive a handlerregistration request specifying an object handler to respond to voicecommands, receive an utterance of a user, transmit the utterance of theuser to a remote cloud services layer, receive an intent and an entityfrom the remote cloud services layer, wherein the intent is associatedwith the entity, and dispatch the intent and the entity to the objecthandler.

2. The voice control hub computing system of aspect 1, including furtherinstructions that, when executed, cause the voice control hub to:receive a text string representing speech-to-text output from the remotecloud services layer.

3. The voice control hub computing system of aspect 1, including furtherinstructions that, when executed, cause the voice control hub to:receive a channel subscription specifying a channel and one or both of(i) an intent type, and (ii) an entity type, and based on the channelsubscription, dispatch the intent and the entity to the channel.

4. The voice control hub computing system of any one of aspects 1through 3, wherein the object handler is a dynamically compiledfunction.

5. The voice control hub computing system of any one of aspects 1through 4, wherein the utterance of the user is received in response toa wake word utterance of the user.

6. The voice control hub computing system of aspect 1, including furtherinstructions that, when executed, cause the voice control hub to:synthesize a speech response to the user responsive to the utterance ofthe user, and cause the speech response to be output in an audio speakerof a computing device of the user.

7. The voice control hub computing system of aspect 1, including furtherinstructions that, when executed, cause the voice control hub to: set avalue in a global context visible to the object handler.

8. The voice control hub computing system of any one of aspects 1through 7, wherein the voice control hub computing system is packaged asa shared object that an application can access to enable voicefunctionality in the application.

9. A computer-implemented method for enabling voice functionality in anapplication, comprising: receiving a handler registration requestspecifying an object handler to respond to voice commands, receiving anutterance of a user, transmitting the utterance of the user to a remotecloud services layer, receiving an intent and an entity from the remotecloud services layer, wherein the intent is associated with the entity,and dispatching the intent and the entity to the object handler.

10. The computer-implemented method of aspect 9, further comprising:receiving a text string representing speech-to-text output from theremote cloud services layer.

11. The computer-implemented method of aspect 9, further comprising:receiving a channel subscription specifying a channel and one or both of(i) an intent type, and (ii) an entity type, and based on the channelsubscription, dispatching the intent and the entity to the channel.

12. The computer-implemented method of aspect 9, further comprising:compiling the object handler as a dynamic function.

13. The computer-implemented method of aspect 9, further comprising:receiving utterance of the user in response to the user uttering a wakeword.

14. The computer-implemented method of aspect 9, further comprising:synthesizing a speech response to the user responsive to the utteranceof the user, and causing the speech response to be output in an audiospeaker of a computing device of the user.

15. The computer-implemented method of any one of aspects 9 through 14,wherein synthesizing the speech response to the user responsive to theutterance to the user is part of a multi-turn interaction with the user.

16. A non-transitory computer readable medium containing programinstructions that when executed, cause a computer to: receive a handlerregistration request specifying an object handler to respond to voicecommands, receive an utterance of a user, transmit the utterance of theuser to a remote cloud services layer, receive an intent and an entityfrom the remote cloud services layer, wherein the intent is associatedwith the entity, and dispatch the intent and the entity to the objecthandler.

17. The non-transitory computer readable medium of aspect 16 containingfurther program instructions that when executed, cause a computer to:receive a text string representing speech-to-text output from the remotecloud services layer.

18. The non-transitory computer readable medium of aspect 16 containingfurther program instructions that when executed, cause a computer to:receive a channel subscription specifying a channel and one or both of(i) an intent type, and (ii) an entity type, and based on the channelsubscription, dispatch the intent and the entity to the channel.

19. The non-transitory computer readable medium of aspect 16 containingfurther program instructions that when executed, cause a computer to:synthesize a speech response to the user responsive to the utterance ofthe user, and cause the speech response to be output in an audio speakerof a computing device of the user.

20. The non-transitory computer readable medium of aspect 16 containingfurther program instructions that when executed, cause a computer to:set a value in a global context visible to the object handler.

Additional Considerations

The above voice-enabled functionality works in any application context.There is no need for a user to enter a particular dictation modecontext, as with a conventional speech recognition program, although themethods and systems do support an explicit dictation context. The abovetechniques allow a user to enter a voice recognition mode concurrentlywith another task, such as editing an email, or seamlessly navigatingfrom one application to another.

The following considerations also apply to the foregoing discussion.Throughout this specification, plural instances may implement operationsor structures described as a single instance. Although individualoperations of one or more methods are illustrated and described asseparate operations, one or more of the individual operations may beperformed concurrently, and nothing requires that the operations beperformed in the order illustrated. These and other variations,modifications, additions, and improvements fall within the scope of thesubject matter herein.

It should also be understood that, unless a term is expressly defined inthis patent using the sentence “As used herein, the term ” “ is herebydefined to mean . . . ” or a similar sentence, there is no intent tolimit the meaning of that term, either expressly or by implication,beyond its plain or ordinary meaning, and such term should not beinterpreted to be limited in scope based on any statement made in anysection of this patent (other than the language of the claims). To theextent that any term recited in the claims at the end of this patent isreferred to in this patent in a manner consistent with a single meaning,that is done for sake of clarity only so as to not confuse the reader,and it is not intended that such claim term be limited, by implicationor otherwise, to that single meaning. Finally, unless a claim element isdefined by reciting the word “means” and a function without the recitalof any structure, it is not intended that the scope of any claim elementbe interpreted based on the application of 35 U.S.C. § 112 (f).

Unless specifically stated otherwise, discussions herein using wordssuch as “processing,” “computing,” “calculating,” “determining,”“presenting,” “displaying,” or the like may refer to actions orprocesses of a machine (e.g., a computer) that manipulates or transformsdata represented as physical (e.g., electronic, magnetic, or optical)quantities within one or more memories (e.g., volatile memory,non-volatile memory, or a combination thereof), registers, or othermachine components that receive, store, transmit, or displayinformation.

As used herein any reference to “one embodiment” or “an embodiment”means that a particular element, feature, structure, or characteristicdescribed in connection with the embodiment is included in at least oneembodiment. The appearances of the phrase “in one embodiment” in variousplaces in the specification are not necessarily all referring to thesame embodiment.

As used herein, the terms “comprises,” “comprising,” “includes,”“including,” “has,” “having” or any other variation thereof, areintended to cover a non-exclusive inclusion. For example, a process,method, article, or apparatus that comprises a list of elements is notnecessarily limited to only those elements but may include otherelements not expressly listed or inherent to such process, method,article, or apparatus. Further, unless expressly stated to the contrary,“or” refers to an inclusive or and not to an exclusive or. For example,a condition A or B is satisfied by any one of the following: A is true(or present) and B is false (or not present), A is false (or notpresent) and B is true (or present), and both A and B are true (orpresent).

In addition, use of “a” or “an” is employed to describe elements andcomponents of the embodiments herein. This is done merely forconvenience and to give a general sense of the invention. Thisdescription should be read to include one or at least one and thesingular also includes the plural unless it is obvious that it is meantotherwise.

Upon reading this disclosure, those of skill in the art will appreciatestill additional alternative structural and functional designs forimplementing the concepts disclosed herein, through the principlesdisclosed herein. Thus, while particular embodiments and applicationshave been illustrated and described, it is to be understood that thedisclosed embodiments are not limited to the precise construction andcomponents disclosed herein. Various modifications, changes andvariations, which will be apparent to those skilled in the art, may bemade in the arrangement, operation and details of the method andapparatus disclosed herein without departing from the spirit and scopedefined in the appended claims.

1. A voice control hub computing system for performing a task within anenterprise business software application, comprising one or moreprocessors, and a memory containing instructions that, when executed,cause the voice control hub computing system to: receive a handlerregistration request specifying an object handler to respond to voicecommands, receive an utterance of a user of the enterprise businesssoftware application, transmit the utterance of the user to a remotecloud services layer, convert the utterance of the user to a text stringrepresenting speech-to-text output using a custom speech model, analyzethe text string using one or more trained machine learning models togenerate an intent and an entity corresponding to the task, receive theintent and the entity from the remote cloud services layer, wherein theintent is associated with the entity, and dispatch the intent and theentity to the object handler.
 2. The voice control hub computing systemof claim 1, including further instructions that, when executed, causethe voice control hub to: receive the text string representingspeech-to-text output from the remote cloud services layer.
 3. The voicecontrol hub computing system of claim 1, including further instructionsthat, when executed, cause the voice control hub to: receive a channelsubscription specifying a channel and one or both of (i) the intenttype, and (ii) the entity type, and based on the channel subscription,dispatch the intent and the entity to the channel.
 4. The voice controlhub computing system of claim 1, wherein the object handler is adynamically compiled function.
 5. The voice control hub computing systemof claim 1, wherein the utterance of the user is received in response toa wake word utterance of the user.
 6. The voice control hub computingsystem of claim 1, including further instructions that, when executed,cause the voice control hub to: synthesize a speech response to the userresponsive to the utterance of the user, and cause the speech responseto be output in an audio speaker of a computing device of the user. 7.The voice control hub computing system of claim 1, including furtherinstructions that, when executed, cause the voice control hub to: set avalue in a global context visible to the object handler.
 8. The voicecontrol hub computing system of claim 1, wherein the voice control hubcomputing system is packaged as a shared object that the enterprisebusiness software application can access to enable voice functionalityin the application.
 9. A computer-implemented method for enabling voicefunctionality in an enterprise business software application,comprising: receiving a handler registration request specifying anobject handler to respond to voice commands, receiving an utterance of auser of the enterprise business software application, transmitting theutterance of the user to a remote cloud services layer, converting theutterance of the user to a text string representing speech-to-textoutput using a custom speech model, analyzing the text string using oneor more trained machine learning models to generate an intent and anentity corresponding to a task, receiving the intent and the entity fromthe remote cloud services layer, wherein the intent is associated withthe entity, and dispatching the intent and the entity to the objecthandler.
 10. The computer-implemented method of claim 9, furthercomprising: receiving the text string representing speech-to-text outputfrom the remote cloud services layer.
 11. The computer-implementedmethod of claim 9, further comprising: receiving a channel subscriptionspecifying a channel and one or both of (i) the intent type, and (ii)the entity type, and based on the channel subscription, dispatching theintent and the entity to the channel.
 12. The computer-implementedmethod of claim 9, further comprising: compiling the object handler as adynamic function.
 13. The computer-implemented method of claim 9,further comprising: receiving utterance of the user in response to theuser uttering a wake word.
 14. The computer-implemented method of claim9, further comprising: synthesizing a speech response to the userresponsive to the utterance of the user, and causing the speech responseto be output in an audio speaker of a computing device of the user. 15.The computer-implemented method of claim 14, wherein synthesizing thespeech response to the user responsive to the utterance to the user ispart of a multi-turn interaction with the user.
 16. A non-transitorycomputer readable medium containing program instructions that whenexecuted, cause a computer to: receive a handler registration requestspecifying an object handler to respond to voice commands, receive anutterance of a user of the enterprise business software application,transmit the utterance of the user to a remote cloud services layer,convert the utterance of the user to a text string representingspeech-to-text output using a custom speech model, analyze the textstring using one or more trained machine learning models to generate anintent and an entity corresponding to a task, receive the intent and theentity from the remote cloud services layer, wherein the intent isassociated with the entity, and dispatch the intent and the entity tothe object handler.
 17. The non-transitory computer readable medium ofclaim 16 containing further program instructions that when executed,cause a computer to: receive the text string representing speech-to-textoutput from the remote cloud services layer.
 18. The non-transitorycomputer readable medium of claim 16 containing further programinstructions that when executed, cause a computer to: receive a channelsubscription specifying a channel and one or both of (i) the intenttype, and (ii) the entity type, and based on the channel subscription,dispatch the intent and the entity to the channel.
 19. Thenon-transitory computer readable medium of claim 16 containing furtherprogram instructions that when executed, cause a computer to: synthesizea speech response to the user responsive to the utterance of the user,and cause the speech response to be output in an audio speaker of acomputing device of the user.
 20. The non-transitory computer readablemedium of claim 16 containing further program instructions that whenexecuted, cause a computer to: set a value in a global context visibleto the object handler.